SRE-3704 ci: Fault injection testing stage on VM/bare metal#17953
SRE-3704 ci: Fault injection testing stage on VM/bare metal#17953
Conversation
|
Errors are Unable to load ticket data |
276641f to
23827b4
Compare
e724b71 to
14b4ae9
Compare
|
Test stage Test RPMs on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17953/56/execution/node/424/log |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17953/56/display/redirect |
1 similar comment
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17953/56/display/redirect |
eea6d40 to
8bf0a15
Compare
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17953/62/execution/node/369/log |
3591870 to
fdc56a7
Compare
|
Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17953/69/testReport/ |
dd13687 to
07365a4
Compare
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17953/76/testReport/ |
45aa283 to
51c7487
Compare
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Skip-python-bandit: true Skip-unit-tests:true Skip-unit-test: true Skip-NLT: true Skip-unit-test-memcheck: true Skip-func-vm: true Skip-test-el-9.6-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Skip-python-bandit: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-build-leap15-icc: true Skip-unit-tests:true Skip-unit-test: true Skip-NLT: true Skip-unit-test-memcheck: true Skip-func-test-el8: true Skip-func-test-el9: true Skip-func-test-leap15: true Skip-test-el-9.6-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Cancel-prev-build: false Skip-python-bandit: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-build-leap15-icc: true Skip-unit-tests:true Skip-unit-test: true Skip-NLT: true Skip-unit-test-memcheck: true Skip-func-test-el9: true Skip-test-el-9.6-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Cancel-prev-build: false Skip-python-bandit: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-build-leap15-icc: true Skip-unit-tests:true Skip-unit-test: true Skip-NLT: true Skip-unit-test-memcheck: true Skip-func-test-el9: true Skip-test-el-9.6-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Cancel-prev-build: false Priority: 2
Run NLT and Fault Injection Tests no dedicated VMs with 64GiB of memory reserved. Limit NLT memory to 16GiB Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true
|
Test stage NLT completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17953/92/display/redirect |
|
Test stage NLT Fault injection testing completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17953/92/display/redirect |
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
This reverts commit 2323fd9. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
Fault injection must have NLT in stage name Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
This reverts commit b03decb. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
d1a18e4 to
bdd0209
Compare
Fix access right for nlt_logs Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
mprotect-based Argobots ULT stack overflow checking causes a TLB shootdown IPI on every stack allocation/deallocation. On KVM hosts running multiple VMs in parallel this results in VM exits across all vCPUs, significantly increasing latency under concurrent load. Remove the setting to use the default (no overflow check), which is acceptable for a CI/test environment where crashes are already caught by the test harness. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
….yaml" This reverts commit dd9c9c0. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
…r.yaml" This reverts commit adcac00. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
…Test-FI Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
- Add fallback `fault_status` detection: if the primary detection via `$PREFIX/bin` fails, try resolving `fault_status` via `$PATH`, improving robustness when the binary is installed via RPM rather than built in-tree. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
This PR introduces logic that simplifies the Fault Injection testing stage in CI (the Jenkinsfile)
by moving it from a Docker container environment to a provisioned VM/bare metal environment.
Requires:
or
or
Background
The old
Fault injection testingstage ran NLT fault injection inside a Docker container(
docker_runner_fi+Dockerfile.el.9) on a shared Jenkins agent host. Up to 10 FIcontainers could execute simultaneously on the same host alongside other CI workloads,
resulting in severe CPU and network resource contention. The symptoms were well-documented:
RPC timeouts, SWIM protocol failures to make progress, and "Sluggish EC boundary" warnings —
all caused by infrastructure overload rather than real code defects.
Additionally, nlt_server.yaml had
ABT_STACK_OVERFLOW_CHECK=mprotectset, which causes Argobots to issuemprotect()calls for ULT stack overflow detection. On KVM-based VMs, each such call triggers TLB shootdown IPIs across all vCPUs, making test execution significantly slower on VMs than on bare metal or inside Docker containers where this overhead is less pronounced. This was a known cause of very long and unpredictable FI test execution times when running on VMs. Now that the stage runs on a dedicated provisioned VM with proper resources,ABT_STACK_OVERFLOW_CHECK=mprotectis removed from nlt_server.yaml, restoring test execution duration comparable to bare metal.Two workarounds were introduced to mask this instability:
DAOS-623 test: add allowed error for FI, commit e0fd4e3):added
skip_substringsfilters innode_local_test.pyandcart_logtest.pyto suppressSWIM/network-related error conditions ("sluggish ec boundary report from rank",
"sluggish stable epoch reporting", "progress callback was not called for too long",
"rpc failed; rc:") that were firing due to Docker resource contention.
DAOS-623 test: ignore the server errors in client FI tests too):extended the same suppression to server-side errors seen in NLT client FI runs.
Both PRs were explicitly described as temporary workarounds, with the expectation that they
would be reverted once FI testing was moved to a dedicated, stable environment. This PR
delivers that fix and reverts both workarounds (e0fd4e3 / #17959 and #17999), restoring
full error checking in
node_local_test.pyandcart_logtest.py.Solution
The
NLT Fault Injection testingstage now runs on a dedicated provisioned VM(
CI_FI_1_LABEL, defaultci_fi_vm1) using the sameunitTest/unitTestPostpipelineprocedures as the NLT and Unit Test stages. This mirrors how NLT tests have always been
run — on bare metal/VM nodes exclusively allocated for that purpose — and brings the same
benefits to FI testing:
VM_CPUS=20in pipeline-lib) eliminate the resource contention that caused SWIMand RPC failures. With 20+ cores,
AllocFailTest.launch()can run FI tests inparallel (
max_child = 15) instead of the forced serial mode (max_child = 1)that occurred when the Docker container saw fewer than 20 vCPUs.
ABT_STACK_OVERFLOW_CHECK=mprotectis removed fromutils/nlt_server.yaml,eliminating the cascading TLB shootdown IPIs that occurred when multiple FI
containers ran simultaneously on a shared KVM host.
Docker containers on a shared host.
removing a full SCons build from the critical path and significantly reducing
stage runtime.
node_local_test.pyandcart_logtest.py;the
skip_substringssuppression introduced in DAOS-623 test: add allowed error for FI #17959 and DAOS-623 test: ignore the server errors in client FI tests too #17999 is removed.unitTestPostpath, consistent with all other test stages.
The stage is renamed from
Fault injection testingtoNLT Fault Injection testingtoavoid confusion with the existing
Fault injection testingstage and to enable detectionin
parseStageInfo/skipStagein pipeline-lib.Jenkinsfile:
Fault injection testingstage (Docker build +nlt_test()) with the newNLT Fault Injection testingstage running on a provisioned VM viaunitTest.nlt_test()helper function entirely — its logic is now handled byunitTest/unitTestPostin pipeline-lib.CI_FI_1_LABELparameter (ci_fi_vm1) for the new FI VM pool; renameCI_NLT_1_LABELdefault fromci_nlt_1toci_nlt_vm1.fault-inject-valgrindstash fromvalgrindReportPublish— FI runswith
--memcheck noand produces no memcheck XML.ci/docker_nlt.sh:
via the standard
unitTestpath.ci/provisioning/post_provision_config_common_functions.sh:
maldeton provisioned nodes;maldetscans add CPU load during NLT tests.ci/unit/test_nlt.sh:
ssh -tt+ inline heredoc execution withssh -T … bash -s -- $*pipingtest_nlt_node.shover stdin, so that command-line arguments ($*) are forwardedcorrectly to
test_nlt_node.sh(required for the--memcheck no --class-name fault-injection fiarguments passed by the FI stage).ci/unit/test_nlt_node.sh:
sudo mkdir -p /mnt/daos(no longer needed on provisioned VMs).$*; default to the original NLT run parameters whenno arguments are given, making the script reusable for both plain NLT and FI.
tmpfsonnlt_logs/and setTMPDIRto it before executingnode_local_test.py, so NLT log files land on a fast in-memory filesystem.exec envto setHTTPS_PROXY/NO_PROXYcleanly.ci/unit/test_nlt_post.sh:
rsyncpass to also collect logs frombuild/nlt_logs/on the node(NLT with
--no-rootwrites logs there instead of/tmp/).rsynccalls non-fatal (|| true) so post steps do not fail on missinglog directories.
utils/nlt_server.yaml:
ABT_STACK_OVERFLOW_CHECK=mprotectfrom engineenv_vars; the mprotect-basedULT stack overflow detection is no longer needed and was a source of TLB shootdown
overhead on shared KVM hosts.
utils/node_local_test.py:
skip_substringsworkaround block (revert of DAOS-623 test: add allowed error for FI #17959 / e0fd4e3 andDAOS-623 test: ignore the server errors in client FI tests too #17999): "sluggish ec boundary report from rank", "sluggish stable epoch reporting",
"progress callback was not called for too long", "rpc failed; rc:" are no longer
suppressed — these conditions should not occur on a dedicated VM.
fault_statusdetection: if the initial detection fails, tryfault_statuson
$PATHand then/usr/bin/fault_statusbefore giving up, improving robustness whenthe binary is installed via RPM rather than built in-tree.
src/tests/ftest/cart/util/cart_logtest.py:
self.skip_substrings = []and the associated substring-suppression check block(revert of the
cart_logtest.pyportion of DAOS-623 test: add allowed error for FI #17959 / DAOS-623 test: ignore the server errors in client FI tests too #17999), restoring full logerror detection.
Steps for the author:
After all prior steps are complete: